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Abstract 

Most of the non-asymptotic theoretical work in regression is carried out for the 
square loss, where estimators can be obtained through closed-form expressions. In 
this paper, we use and extend tools from the convex optimization literature, namely 
self-concordant functions, to provide simple extensions of theoretical results for the 
square loss to the logistic loss. We apply the extension techniques to logistic regression 
with regularization by the i'2-norm and regularization by the £i-norm, showing that new 
results for binary classification through logistic regression can be easily derived from 
corresponding results for least-squares regression. 

1 Introduction 

The theoretical analysis of statistical methods is usually greatly simplified when the esti- 
mators have closed-form expressions. For methods based on the minimization of a certain 
functional, such as M-estimation methods [Tl, this is true when the function to minimize is 
quadratic, i.e., in the context of regression, for the square loss. 

When such loss is used, asymptotic and non-asymptotic results may be derived with 
classical tools from probability theory (see, e.g., 13). When the function which is minimized 
in M-estimation is not amenable to closed-form solutions, local approximations are then 
needed for obtaining and analyzing a solution of the optimization problem. In the asymptotic 
regime, this has led to interesting developments and extensions of results from the quadratic 
case, e.g., consistency or asymptotic normality (see, e.g., HI). However, the situation is 
different when one wishes to derive non-asymptotic results, i.e., results where all constants 
of the problem are explicit. Indeed, in order to prove results as shaip as for the square 
loss, much notation and many assumptions have to be introduced regarding second and third 
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derivatives; this makes the derived results much more complicated than the ones for closed- 
form estimators Ellllsl. 

A similar situation occurs in convex optimization, for the study of Newton's method 
for obtaining solutions of unconstrained optimization problems. It is known to be locally 
quadratically convergent for convex problems. However, its classical analysis requires cum- 
bersome notations and assumptions regarding second and third-order derivatives (see, e.g., O 
|71). This situation was greatly enhanced with the introduction of the notion of self-concordant 
functions, i.e., functions whose third derivatives are controlled by their second derivatives. 
With this tool, the analysis is much more transparent ||7l HI. While Newton's method is 
a commonly used algorithm for logistic regression (see, e.g., ||9l [13), leading to iterative 
least-squares algorithms, we don't focus in the paper on the resolution of the optimization 
problems, but on the statistical analysis of the associated global minimizers. 

In this paper, we aim to bonow tools from convex optimization and self-concordance to 
analyze the statistical properties of logistic regression. Since the logistic loss is not itself a 
self-concordant function, we introduce in Section [2] a new type of functions with a different 
control of the third derivatives. For these functions, we prove two types of results: first, 
we provide lower and upper Taylor expansions, i.e., Taylor expansions which are globally 
upper-bounding or lower-bounding a given function. Second, we prove results on the be- 
havior of Newton's method which are similar to the ones for self-concordant functions. We 
then apply them in Sections |3l |4] and |5] to the one-step Newton iterate from the population 
solution of the corresponding problem (i.e., ^2 or £1 -regularized logistic regression). This es- 
sentially shows that the analysis of logistic regression can be done non-asymptotically using 
the local quadratic approximation of the logistic loss, without complex additional assump- 
tions. Since this approximation corresponds to a weighted least-squares problem, results 
from least-squares regression can thus be naturally extended. 

In order to consider such extensions and make sure that the new results closely match the 
coiTcsponding ones for least-squai^es regression, we derive in Appendix |G]new Bernstein-like 
concentration inequalities for quadratic forms of bounded random variables, obtained from 
general results on U-statistics lOTI . 

We first apply in Section |4] the extension technique to regularization by the l2-norm, 
where we consider two settings, a situation with no assumptions regarding the conditional 
distribution of the observations, and another one where the model is assumed well-specified 
and we derive asymptotic expansions of the generalization performance with explicit bounds 
on remainder terms. In Section[5l we consider regularization by the ^i-norm and extend two 
known recent results for the square loss, one on model consistency IIT2I [T3l [141 [151 and one 
on prediction efficiency [il6il . The main contribution of this paper is to make these extensions 
as simple as possible, by allowing the use of non-asymptotic second-order Taylor expansions. 

Notation. For x G and g ^ 1, we denote by \\x\\q the Iq-norm of x, defined as ||x||g = 
SiLi ^1^0 denote by ||x||oo = ^^'y^if^{i,...,p} its ^00-norm. We denote by 

Amax(<5) and Amin(<5) the largest and smallest eigenvalue of a symmetric matrix Q. We use 
the notation Qi ^ Q2 (resp. Qi ^ Q2) for the positive semi-definiteness of the matrix 



2 



Q2 - Qi (resp. Qi - Q2). 

For a S R, sign(a) denotes the sign of a, defined as sign(a) = 1 if a > 0, — 1 if a < 0, 
and if a = 0. For a vector v G MP, sign(w) G { — 1,0, 1}*' denotes the vector of signs of 
elements of v. 

Moreover, given a vector u G and a subset / of {1, . . . , p}, |/| denotes the cardinal of 
the set /, VI denotes the vector in M'^I of elements of v indexed by /. Similarly, for a matrix 
A G RP'^P, Aij denotes the submatrix of A composed of elements of A whose rows are in 
/ and columns ai"e in J. Finally, we let denote P and E general probability measures and 
expectations. 

2 Taylor expansions and Newton's method 

In this section, we consider a generic function F : ^ M, which is convex and three times 
differentiable. We denote by F'{w) G R^ its gradient at w G W>, by F"{w) G W^p its 
Hessian at if G R^. We denote by \{w) ^ the smallest eigenvalue of the Hessian F"{w) 
atw e WP. 

If X{w) > 0, i.e., the Hessian is invertible at w, we can define the Newton step as 
A^{w) = —F"{w)~^F'{w), and the Newton decrement v{F, w) at w, defined through: 

v{F,wf = F'{w)^F"{wy^F'{w) = A^(u;)^F"(u;)A^(w). 

The one-step Newton iterate w + A^(w) is the minimizer of the second-order Taylor expan- 
sion of F at w, i.e., of the function v ^ F{w) + F'{w){v — tw) + ^(i; — w)^ F" {w){v — w). 
Newton's method consists in successively applying the same iteration until convergence. For 
more background and details about Newton's method, see, e.g., |[7ll6l[T7l. 

2.1 Self-concordant functions 

We now review some important properties of self-concordant functions ItKH, i.e., three times 
differentiable convex functions such that for all n, f G R^, the function g : t ^ F{u + tv) 
satisfies for all t G R, \g"'{t)\ ^ 2g"{tf/'^. 

The local behavior of self-concordant functions is well-studied and lower and upper Tay- 
lor expansions can be derived (similar to the ones we derive in Proposition [T]). Moreover, 
bounds are available for the behavior of Newton's method; given a self-concordant function 
F, if til G R^ is such that v{F, w) ^ 1/4, then F attains its unique global minimum at some 
w* G R^, and we have the following bound on the error w — w* (see, e.g., HI): 

{w - w*)^F"{w){w - w*) ^ Aij{F, wf. (1) 

Moreover, the newton decrement at the one-step Newton iterate from ^i; G R^ can be upper- 
bounded as follows: 

u{F,w + A^{w))!^v{F,wf, (2) 
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which allows to prove an upper-bound of the error of the one-step iterate, by application of 
Eq. ([T]) to t/; + {w). Note that these bounds are not the sharpest, but are sufficient in our 
context. These are commonly used to show the global convergence of the damped Newton's 
method HI or of Newton's method with backti"acking line seaixh |i7j, as well as a precise 
upper bound on the number of iterations to reach a given precision. 

Note that in the context of machine learning and statistics, self-concordant functions have 
been used for bandit optimization and online learning fTSl, but for barrier functions related 
to constrained optimization problems, and not directly for M-estimation. 



2.2 Modifications of self-concordant functions 

The logistic function u log(l + e"") is not self-concordant as the third derivative is 
bounded by a constant times the second derivative (without the power 3/2). However, similar 
bounds can be derived with a different control of the third derivatives. Proposition [Uprovides 
lower and upper Taylor expansions while Proposition |2] considers the behavior of Newton's 
method. Proofs may be found in Appendix |A] and follow closely the ones for regular self- 
concordant functions found in jSl . 

Proposition 1 (Taylor expansions) Let F : MP R be a convex three times dijferentiable 
function such that for all w,v £ MP, the function g{t) = F{w + tv) satisfies for all t G M, 
\g"'{t)\ ^ R\\v\\2 X g" {t), for some R ^ 0. We then have for all w,v,z G W: 



F{w + v) ^ F{w) + v'^F'iw) + 7,'' (e-^IHI^ + R\\v\\2 - 1), (3) 
F{w + v) ^ F{w)+v'^F'{w) + - R\\^h - 1), (4) 



II'^II2 

[z ' F"[w)z\'-''' R\\v\\2 

e-RM^F"{w) 4 F"{w + v) 4 e^^^''^^^ F" (w) . (6) 

Inequalities in Eq. Q and Eq. ([Hi provide upper and lower second-order Taylor expansions 
of F, while Eq. ^ provides a first-order Taylor expansion of F' and Eq. ^ can be con- 
sidered as an upper and lower zero-order Taylor expansion of F". Note the difference here 
between Eqs. ((SHU) and regular third-order Taylor expansions of F: the remainder term in the 
Taylor expansion, i.e., F{w + v) — F{w) — F'{w) — ^v~^ F" {w)v is upper-bounded by 
!L^_i|^(e"^li*'ll2 _ lij2||^||2 _ ^||^;||2 — 1); for ||?;||2 small, we obtain a term proportional 

to \\v\\2 (like a regular local Taylor expansion), but the bound remains valid for all v and does 
not grow as fast as a third-order polynomial. Moreover, a regular Taylor expansion with a 
uniformly bounded third-order derivative would lead to a bound proportional to ||f Hf, which 
does not take into account the local curvature of F at lu. Taking into account this local cur- 
vature is key to obtaining sharp and simple bounds on the behavior of Newton's method (see 
proof in Appendix lAl): 
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Proposition 2 (Behavior of Newton's method) Let F : W ^ M.be a convex three times 
dijferentiable function such that for all WjV £ M^, the function g{t) = F{w + tv) satisfies 
for all t G M, \g"'{t)\ ^ R\\v\\2 x g"it), for some R ^ 0. Let \{w) > be the lowest 
eigenvalue of F"{w) for some w S W. If v{F, w) ^ ^^^r — > ^hen F has a unique global 
minimizer w* G W and we have: 

(w — w*)~^ F"{w)[w — w*^ 
Ru{F,w + A^{w)) 

{w + A^{w)-w*yF"{w){w + A^{w)-w*) 

Eq. O extends Eq. ([T]) while Eq. ([S]) extends Eq. Q. Note that the notion and the results 
are not invariant by affine transform (contrary to self-concordant functions) and that we still 
need a (non-uniformly) lower-bounded Hessian. The last two propositions constitute the 
main technical contribution of this paper. We now apply these to logistic regression and its 
regularized versions. 



^ 16iy{F,wf, 
(Rv{F,w) 



V A(u;)i/2 



\{w) 



v{F,w)\ 



(7) 
(8) 

(9) 



3 Application to logistic regression 



We consider n pairs of observations {xi,yi) in 
function for logistic regression: 



Jq{w) 



1 



•g[l + exp{-yiW~^Xi 



i=l 



X { — 1, 1} and the following objective 



1 



(10) 



i=l 



where ^ : n i— > log(e + e**/^) is an even convex function. A short calculation leads to 
e{u) = -1/2 + a{u),i"{u) = a{u)[l - (t{u)\, e" {u) = cj(u)[1 - cj(m)][1 - 2(t(u)], where 
a{u) = (1 + e~^)~^ is the sigmoid function. Note that we have for all li G M, ^ 
i"{u). The cost function Jq defined in Eq. (flOl ) is proportional to the negative conditional 
log-likelihood of the data under the conditional model F{yi = ei\xi) = a{eiW^ Xi). 

\f R = maxjgji ||2;j||2 denotes the maximum ^2-norm of all input data points, then 
the cost function Jq defined in Eq. (flOl ) satisfies the assumptions of Proposition |2] Indeed, 
we have, with the notations of Proposition |2j 



W"{t)\ 





1 




n 




1 - 


< 




n ■ 



1=1 

n 



1=1 



Throughout this paper, we will consider a certain vector w G W (usually defined through 
the population functional) and consider the one-step Newton iterate from this w. Results 
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from Section 12.21 will allow to show that this approximates the global minimum of Jo or a 
regularized version thereof. 

Throughout this paper, we consider a. fixed design setting (i.e., xi, . . . are consider 
deterministic) and we make the following assumptions: 

(Al) Independent outputs: The outputs € { — l,l},i = l,...,nare independent (but not 
identically distributed). 

(A2) Bounded inputs: niaxjg|x^...^„} ||a;j||2 ^ R- 

We define the model as well-specified if there exists wq E W such that for all i = 
1, . . . , n, P(?/i = Ei) = a{eiWQ x-i), which is equivalent to E(yj/2) = £'{'Wq Xi), and implies 
var(yj/2) = i"{wQ Xi). However, we do not always make such assumptions in the paper. 

We use the matrix notation X = [xi, . . . ,Xn]^ G M"^'' for the design matrix and Si = 
yi/2 — E(yj/2), for i = 1, . . . , n, which formally corresponds to the additive noise in least- 
squares regression. We also use the notation Q = Diag(var(y.j/2))X G M^^*' and 
q = e € M^. By assumption, we have E(g(/^) = ^Q. 

We denote by Jq the expectation of Jq, i.e.: 

1 " 

Jo{w) =E[Jo{w)] =-Yl - E(?/i/2)7i;^Xi} . 

i=l 

Note that with our notation, Jo{w) = Jo{w) — q^w. In this paper we consider Jo{w) 
as the generalization performance of a certain estimator w. This corresponds to the aver- 
age Kullback-Leibler divergence to the best model when the model is well-specified, and 
is common for the study of logistic regression and more generally generalized linear mod- 
els |[T9ll20l . Measuring the classification performance through the 0-1 loss |[2TI is out of the 
scope of this paper. 

The function Jq is bounded from below, therefore it has a bounded infimum inf Jo{w) ^ 
0. This infimum might or might not be attained at a finite wq G K^; when the model is well- 
specified, it is always attained (but this is not a necessary condition), and, unless the design 
matrix X has rank p, is not unique. 

The difference between the analysis through self-concordance and the classical asymp- 
totic analysis is best seen when the model is well-specified, and exactly mimics the difference 
between self-concordant analysis of Newton's method and its classical analysis. The usual 
analysis of logistic regression requires that the logistic function u ^ log(H-e~") is strongly 
convex (i.e., with a strictly positive lower-bound on the second derivative), which is true only 
on a compact subset of M. Thus, non-asymptotic results such as the ones from IHO requires 
an upper bound M on where wq is the generating loading vector; then, the second 

derivative of the logistic loss is lower bounded by (1 + e*^)~^, and this lower bound may be 
very small when M gets large. Our analysis does not require such a bound because of the 
fine control of the third derivative. 
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4 Regularization by the ^2 -norm 



We denote by Jx{w) = Jo{w) + ^Hu^Hl the empirical ^2-i'egulaiized functional. For A > 
0, the function Jx is strongly convex and we denote by w\ the unique global minimizer 
of Jx. In this section, our goal is to find upper and lower bounds on the generalization 
performance Jq{wx), under minimal assumptions (Section 1431) or when the model is well- 
specified (Section l43] ). 

4.1 Reproducing kernel Hilbert spaces and splines 

In this paper we focus explicitly on linear logistic regression, i.e., on a generalized hnear 
model that allows linear dependency between Xi and the distribution of yj. Although ap- 
parently limiting, in the context of regularization by the i?2-norm, this setting contains non- 
parametric and non-linear methods based on splines or reproducing kernel Hilbert spaces 
(RKHS) II22I . Indeed, because of the representer theorem ll23l . minimizing the cost function 

1 ^ \ 

-Y.{^[f{x^)'^-^f{x.)} + \\\f\\%. 

i=l 

with respect to the function / in the RKHS (with norm || • and kernel k), is equivalent 
to minimizing the cost function 

if;{£[(r/3),]-|(r/3),} + ^||/3||i, (11) 

1=1 

with respect to /? G W, where T G M"^^ is a square root of the kernel matrix K G M"^" 
defined as Kij = k{xi,Xj), i.e., such that K = TT~^ . The unique solution of the original 
problem / is then obtained as f{x) = '}21=i '^i^{x,Xi), where a is any vector satisfying 
TT'^a = Tf3 (which can be obtained by matrix pseudo-inversion |[24l ). Similar develop- 
ments can be carried out for smoothing splines (see, e.g., |[22l|25l ). By identifying the matrix 
T with the data matrix X, the optimization problem in Eq. (fTTl) is identical to minimizing 
Jo{w) + ^\\w\\2, and thus our results apply to estimation in RKHSs. 



4.2 Minimal assumptions (misspecified model) 

In this section, we do not assume that the model is well-specified. We obtain the following 
theorem (see proof in Appendix jB]), which only assumes boundedness of the covariates and 
independence of the outputs: 



Theorem 1 (Misspecified model) Assume (Al), (A2) and X = 19R:\/^-^^^, with 6 G 
(0, 1). Then, with probability at least 1 — 5, for all wq G M^, 



Mwx) ^ Mwo) + {10 + WOR^\\wo\\l)\l^^^^^. (12) 
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In particular, if the global minimum of Jo is attained at wq (which is not an assumption 
of Theorem [T]), we obtain an oracle inequality as Jo{wq) = inf^gKP Jo{w). The lack of 
additional assumptions unsurprisingly gives rise to a slow rate of n~^/^. 

This is to be compared with [26,1 . which uses different proof techniques but obtains sim- 
ilar results for all convex Lipschitz-continuous losses (and not only for the logistic loss). 
However, the techniques presented in this paper allow the derivation of much more precise 
statements in terms of bias and variance (and with better rates), that involves some knowl- 
edge of the problem. We do not pursue detailed results here, but focus in the next section on 
well-specified models, where results have a simpler form. 

This highlights two opposite strategies for the theoretical analysis of regularized prob- 
lems: the first one, followed by |[26ll27l . is mostly loss-independent and relies on advanced 
tools from empirical process theory, namely uniform concentration inequalities. Results aie 
widely applicable and make very few assumptions. However, they tend to give performance 
guarantees which are far below the observed performances of such methods in applications. 
The second strategy, which we follow in this paper, is to restrict the loss class (to linear or 
logistic) and derive the limiting convergence rate, which does depend on unknown constants 
(typically the best linear classifier itself). Once the limit is obtained, we believe it gives a 
better interpretation of the performance of these methods, and if one really wishes to make 
no assumption, taking upper bounds on these quantities, we may get back results obtained 
with the generic strategy, which is exactly what Theorem[T]is achieving. 

Thus, a detailed analysis of the convergence rate, as done in Theorem |2] in the next sec- 
tion, serves two purposes: first, it gives a shaip result that depends on unknown constants; 
second the constants can be maximized out and more general results may be obtained, with 
fewer assumptions but worse convergence rates. 



4.3 Well-specified models 

We now assume that the model is well-specified, i.e., that the probability that = 1 is a 
sigmoid function of a linear function of Xi, which is equivalent to: 

(A3) Well-specified model: There exists wq G W such that E(yj/2) = £'{wQXi). 

Theorem |2] will give upper and lower bounds on the expected risk of the ^2 -regularized 
estimator wx, i.e., Jo{'Wx)- We use the following definitions for the two degrees of freedom 
and biases, which are usual in the context of ridge regression and spline smoothing (see, 
e.g., miHIIl): 

degrees of freedom (1) : di = trQ(Q + A/)~^, 

degrees of freedom (2) : d2 = Q^{Q + A/)~^, 

bias(l): h = X'^wJ {Q + Xiy'^wo, 

bias (2) : 62 = X'^w'^Q{Q + Xiy^WQ. 

Note that we always have the inequalities d2 ^ di ^ min{i?^/A,n} and ^2 ^ &i ^ 
min{A||t(;o III, X'^w^Q'^wq}, and that these quantities depend on A. In the context of RKHSs 



8 



outlined in Section |4~n we have di = trK{K + reA Diag(cjj^))^^, a quantity which is 
also usually referred to as the degrees of freedom ||29l . In the context of the analysis of 
£2 -regularized methods, the two degrees of freedom are necessary, as outlined in Theorems|2] 
and [2 and in |[28l . 

Moreover, we denote by /t > the following quantity 

R (di \ fd2 , Y^'^ 



Such quantity is an extension of the one used by 11301 in the context of kernel Fisher discrim- 
inant analysis used as a test for homogeneity. In order to obtain asymptotic equivalents, we 
require k to be small, which, as shown later in this section, occurs in many interesting cases 
when n is large enough. 

In this section, we will apply results from Section [2] to the functions J\ and Jq. Essen- 
tially, we will consider local quadratic approximations of these functions around the gener- 
ating loading vector wq, leading to replacing the true estimator w\ by the one-step Newton 
iterate from wq. This is only possible if the Newton decrement v{J\^wq) is small enough, 
which leads to additional constraints (in particular the upper-bound on k). 

Theorem 2 (Asymptotic generalization performance) Assume (Al), (A2) and (A3). As- 
sume moreover k ^ 1/16, where k is defined in Eq. di JD . If v G [0, 1/4] satisfies v^{d2 + 
7162)^^^ ^ 12, then, with probability at least 1 — exp(— f ^((i2 + nh2)): 



Jo{w\) - Mwq) - ^ ( ^2 + — 
2 \ n 



^ ( 62 + ^ )(69z; + 2560k). (14) 



Relationship to previous work. When the dimension p of wq is bounded, then under the 
regular asymptotic regime (?i tends to +00), Jo{wx) has the following expansion Jo{wo) + 
^(^2 + a result which has been obtained by several authors in several settings |[3n[32l . 
In this asymptotic regime, the optimal A is known to be of order 0{n~^) |[33l . The main 
contribution of our analysis is to allow a non asymptotic analysis with explicit constants. 
Moreover, note that for the square loss, the bound in Eq. ([T4l ) holds with k = 0, which can 
be linked to the fact that our self-concordant analysis from Propositions [T] and |2] is applicable 
with ii = for the square loss. Note that the constants in the previous theorem could 
probably be improved. 

Conditions for asymptotic equivalence. In order to have the remainder term in Eq. ([T4l ) 
negligible with high probability compared to the lowest order term in the expansion of 
Jo{wx), we need to have d2 + nb2 large and k small (so that v can be taken taking small 
while v'^{d2 + ^62) is large, and hence we have a result with high-probability). The assump- 
tion that d2 + nb2 grows unbounded when n tends to infinity is a classical assumption in the 
study of smoothing splines and RKHSs |[34ll35l . and simply states that the convergence rate 
of the excess risk Jq{w\) — Jq{wq), i.e., 62 + d2/n, is slower than for parametric estimation, 
i.e., slower than n^^. 
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Study of parameter k. First, we always have k ^ 3^(7^ + ^1) ; thus an upper bound 
on K implies an upperbound on ^ + 61 which is needed in the proof of Theorem |2] to show 
that the Newton decrement is small enough. Moreover, k is bounded by the sum of ACbias = 
^^6162 and Kvar = 3^ (tt) (^) Under simple assumptions on the eigenvalues of 
Q or equivalently of Diag((7j)K Diag(cJi), one can show that Kvar is small. For example, if d 
of these eigenvalues are equal to one and the remaining ones are zero, then, Kvar = J^/2nl/2 ■ 
And thus we simply need A asymptotically greater than E?d/n. For additional conditions 
for Kvar> See |[28l[30l . A simple condition for Kbias can be obtained if WqQ~^wq is assumed 
bounded (in the context of RKHSs this is a stricter condition that the generating function is 
inside the RKHS, and is used by [36 ] in the context of sparsity-inducing norms). In this case, 
the bias terms are negligible compared to the vaiiance term as soon as A is asymptotically 
greater than n~^/^. 

Variance term. Note that the diagonal matrix Diag((Tj^) is upperbounded by \l, i.e., T)\a.g{a'j) ^ 
ij, so that the degrees of freedom for logistic regression are always less than the correspond- 
ing ones for least-squares regression (for A multiplied by 4). Indeed, the pairs {xi,yi) for 
which the conditional distribution is close to deterministic are such that af is close to zero. 
And thus it should reduce the variance of the estimator, as little noise is associated with these 
points, and the effect of this reduction is exactly measured by the reduction in the degrees of 
freedom. 

Moreover, the rate of convergence ^2 /n of the variance term has been studied by many 
authors (see, e.g., |[22ll25l[30l ) and depends on the decay of the eigenvalues of Q (the faster 
the decay, the smaller ^2). The degrees of freedom usually grows with n, but in many cases 
is slower than n^/^, leading to faster rates in Eq. ([T4l) . 



4.4 Smoothing parameter selection 

In this section, we obtain a criterion similar to Mallow's Cl l[37l to estimate the generaliza- 
tion eiTor and select in a data-driven way the regularization parameter A (refeiTcd to as the 
smoothing parameter when dealing with splines or RKHSs). The following theorem shows 
that with a data-dependent criterion, we may obtain a good estimate of the generalization 
performance, up to a constant term q^wo independent of A (see proof in Appendix ID)) : 

Theorem 3 (Data-driven estimation of generaUzation performance) Assume (Al), (A2) and 

(A3). Let Qx = ^ I] -Li i"{wjxi)xixj and q = ^ YJi=i{yi/'^ - E(?/i/2))xi. Assume more- 
over K ^ 1/16, where k is defined in Eq. ( 1771 ). Ifv G [0, 1/4] satisfies v^{d2 + 7162)^''^ ^ 12, 
then, with probability at least 1 — ex.p{—v^{d2 + nb2)): 

Jo{w\) - Mwx) - - tr QxiQx + Xiy^ - q^wo 
n 

The previous theorem, which is essentially a non-asymptotic version of results in lOTl l32l 
can be further extended to obtain oracle inequalities when minimizing the data-driven cri- 
terion Jq{wx) + ^ tr Qx[Qx + A/)^^, similar to results obtained in 1351 HH for the square 



62 + — (69f + 2560k). 
n 
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loss. Note that contraiy to least-squares regression with Gaussian noise, there is no need 
to estimate the unknown noise variance (of course only when the logistic model is actually 
well-specified); however, the matrix Q used to define the degrees of freedom does depend on 
wq and thus requires that Q\ is used as an estimate. Finally, criteria based on generalized 
cross-validation |[38l IH could be studied with similar tools. 

5 Regularization by the ^i-norm 

In this section, we consider an estimator w\ obtained as a minimizer of the £i -regularized 
empirical risk, i.e., Jo{w) + A||it;||i. It is well-known that the estimator has some zero com- 
ponents 1391 . In this section, we extend some of the recent results lfT2l [T3l [141 [T5l [T6l l40l 
for the square loss (i.e., the Lasso) to the logistic loss. We assume throughout this section 
that the model is well-specified, that is, that the observations i = 1, . . . , n, are generated 
according to the logistic model P(yi = Si) = a{eiW^Xi). 

We denote hy K = {j ^ {1, . . . ,p} , {'Wo)j / 0} the set of non-zero components of wq 
and s = sign(^«o) £ 0, 1}^ the vector of signs of wq- On top of Assumptions (Al), (A2) 
and (A3), we will make the following assumption regai^ding normalization for each covariate 
(which can always be imposed by renormalization), i.e., 

(A4) Normalized covariates: for all j = 1, . . . ,p, ^ Yll^=i[i^i)j]'^ ^ 1- 

In this section, we consider two different results, one on model consistency (Section ISTI ) 
and one on efficiency (Section 15.21) . As for the square loss, they will both depend on ad- 
ditional assumptions regarding the square p x p matrix Q = - i" {wj Xi)xixJ . This 
matrix is a weighted Gram matrix, which corresponds to the unweighted one for the square 
loss. As already shown in |l5l[3l, usual assumptions for the Gram matrix for the square loss 
are extended, for the logistic loss setting using the weighted Gram matrix Q. In this paper, 
we consider two types of results based on specific assumptions on Q, but other ones could be 
considered as well (such as PTI ). The main contribution of using self-concordant analysis 
is to allow simple extensions from the square loss with short proofs and sharper bounds, in 
particular by avoiding an exponential constant in the maximal value of I^Q^Xjl, i = 1, . . . ,n. 

5.1 Model consistency condition 

The following theorem provides a sufficient condition for model consistency. It is based on 
the consistency condition \\Qk'=kQx^k^k\\oo < 1, which is exactly the same as the one for 
the square loss lITSl [T2l [141 (see proof in Appendix 10): 

Theorem 4 (Model consistency for -regularization) Assume (Al), (A2), (A3) and (A4). 

Assume that there exists rj, p, fj, > such that 

WQk'^kQJIkSkWcsd ^ 1 - (15) 



11 



Amm(Qxi^) ^ P and miUj^K \{wo)j\ ^ Assume X ^ min |^|^^, g|^p^|. T/jen 
probability that the vector of signs ofw\ is different from s = sign('u;o) upperbounded by 

.,e.p ( - ^) . .1^1 e.p ( - . ^I-I e.P ( - ^Q) ■ (.« 

Comparison with square loss. For the square loss, the previous theorem simplifies |[T5l 

3/2 

[T2I : with our notations, the constraint A ^ q^^|^^,| and the last term in Eq. (fT6] ). which are the 
only ones depending on R, can be removed (indeed, the square loss allows the application 
of our adapted self-concordant analysis with the constant R = 0). On the one hand, the 
favorable scaling between p and n, i.e., logp = 0{n) for a certain well-chosen A, is preserved 
(since the logarithm of the added term is proportional to — An). However, on the other hand, 
the terms in R may be large as R is the radius of the entire data (i.e., with all p covariates). 
Bounds with the radius of the data on only the relevant features in K could be derived as well 
(see details in the proof in Appendix |E|. 

Necessary condition. In the case of the square loss, a weak form of Eq. (fTSl). i.e.. WQk'^kQ^k^kWoo ^ 
1 turns out to be necessary and sufficient for asymptotic correct model selection [14.1 . While 
the weak form is clearly necessary for model consistency, and the strict form sufficient (as 
proved in Theorem |4l), we are currently investigating whether the weak condition is also 
sufficient for the logistic loss. 



5.2 Efficiency 

Another type of result has been derived, based on different proof techniques |[T6l and aimed 
at efficiency (i.e., predictive performance). Here again, we can extend the result in a very 
simple way. We assume, given K the set of non-zero components of wq: 

(AS) Restricted eigenvalue condition: 

p = mill — — — > 0. 

||Akc||i^3||Ak||i 2 



Note that the assumption made in IIT6II is slightly stronger but only depends on the car- 
dinality of K (by minimizing with respect to all sets of indices with cardinality equal to the 
one of K). The following theorem provides an estimate of the estimation error as well as an 
oracle inequality for the generalization performance (see proof in Appendix 10: 

Theorem 5 (Efficiency for £i-regularization) Assume (Al), (A2), (A3), (A4), and (A5). 

For all A ^ 4SR\K\ ' ^'^^ probability at least 1 — 2pe~^" we have: 

\\wx-wo\\i 12X\K\p~\ 
Mwx)-Mwo) ^ l2X^\K\p-\ 
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We obtain a result which directly mimics the one obtained in fT6l for the squai^e loss with 

the exception of the added bound on A. In particular, if we take A = \f^^^^^, we 
get with probability at least 1 — 2/p, an upper bound on the generalization performance 
«^o(^a) ^ Jo{wo) + 120^^^|A'|p~^. Again, the proof of this result is a direct extension 
of the corresponding one for the square loss, with few additional assumptions owing to the 
proper self-concordant analysis. 

6 Conclusion 

We have provided an extension of self-concordant functions that allows the simple extensions 
of theoretical results for the square loss to the logistic loss. We have applied the extension 
techniques to regularization by the ^2-iiorm and regularization by the ^i-norm, showing that 
new results for logistic regression can be easily derived from corresponding results for least- 
squares regression, without added complex assumptions. 

The present work could be extended in several interesting ways to different settings. 
First, for logistic regression, other extensions of theoretical results from least-squares regres- 
sion could be carried out: for example, the analysis of sequential experimental design for 
logistic regression leads to many assumptions that could be relaxed (see, e.g., B2l ). Also, 
other regularization frameworks based on sparsity-inducing norms could be applied to lo- 
gistic regression with similar guarantees than for least-squares regression, such as group 
Lasso for grouped variables |f43l or non-parametric problems |[36l . or resampling-based pro- 
cedures B4l l45l that allow to get rid of sufficient consistency conditions. 

Second, the techniques developed in this paper could be extended to other M-estimation 
problems: indeed, other generalized linear models beyond logistic regression could be con- 
sidered where higher-order derivatives can be expressed thi^ough cumulants iT^. Moreover, 
similar developments could be made for density estimation for the exponential family, which 
would in particular lead to interesting developments for Gaussian models in high dimensions, 
where -regularization has proved useful B6ll47l . Finally, other losses for binary or multi- 
class classification are of clear interest ||2TI . potentially with different controls of the third 
derivatives. 

A Proofs of optimization results 

We follow the proof techniques of HI, by simply changing the control of the third order 
derivative. We denote by F"'{w) the third-order derivative of F, which is itself a function 
from M?' X RJ' X W to M. The assumptions made in Propositions [T]and|2]are in fact equivalent 
to (see similar proof in fSl): 

yu,v,w G \F"'[u,v,t]\ ^ R\\u\\2[v'^F"{w)v]^/^[t'^F"{w)t]^/^. (17) 
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A.l Univariate functions 



We first consider univariate functions and prove the following lemma tliat gives upper and 
lower Taylor expansions: 

Lemma 1 Let g be a convex three times dijferentiable function (7 : M M such that for all 
t G M, \g"'{t)\ ^ Sg"{t), for some 5^0. Then, for all t ^ 0: 

+ St-1)^ git) - giO) - g\0)t ^ ^(e^* -St-1). (18) 

Proof Let us first assume that g"{t) is strictly positive for all t G R. We have, for all t ^ 0: 
-5 < £noM:M <: s. Then, by integrating once between and t, taking exponentials, and 
then integrating twice: 

- St ^ log g" (t) - log g"{0)^ St, 
/(0)e-^* ^ g"{t) ^ g"{0)e'\ (19) 
g"iO)S-\l - e-^*) ^ g'{t) - g'{0) ^ /(O)S-i(e^* - 1), 

git) ^ g{0) + g'{0)t + g"{0)S-^{e-^' + St - 1), (20) 
g{t) < g{0) + g'io)t + g"iO)S~\e''' -St- 1), (21) 



which leads to Eq. ([Tl 

Let us now assume only that ^"(O) > 0. If we denote by A the connected component that 
contains of the open set {t G R, g"{t) > 0}, then the preceding developments are valid on 
A; thus, Eq. ( [T9l ) implies that A is not upper-bounded. The same reasoning on —g ensures 
that A = R and hence g"{t) is strictly positive for all t G R. Since the problem is invariant 
by translation, we have shown that if there exists to £ such that (7" (to) > 0, then for all 
t G R, g"{t) > 0. 

Thus, we need to prove Eq. (fTSl ) for g" always strictly positive (which is done above) and 
for g" identically equal to zero, which impUes that g is linear, which is then equivalent to 
Eq. (E 



Note the difference with a classical uniform bound on the third derivative, which leads to a 
third-order polynomial lower bound, which tends to —00 more quickly than Eq. (l20l) . More- 
over, Eq. (I2TI ) may be interpreted as an upperbound on the remainder in the Taylor expansion 
of g around 0: 

g{t) - g{0) - g'{0)t - ^t' ^ g"{0)S-\e^' - ^Sh^ -St-1). 

The right hand-side is equivalent to ^g"{0) for t close to zero (which should be expected 
from a three-times differentiable function such that g"'{0) ^ Sg"{0)), but still provides a 
good bound for t away from zero (which cannot be obtained from a regular Taylor expansion). 



Throughout the proofs, we will use the fact that the functions u ^ - — - and u 



can be extended to continuous functions on R, which are thus bounded on any compact. The 
bound will depend on the compact and can be obtained easily. 
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A.2 Proof of Proposition [T] 

By applying Lemma [T] (Eq. (l20l) and Eq. (|2TI) ) to (7(t) = F{'w + tv) (with constant 5 = 
ii||?;||2) and taking t = 1, we get the desired first two inequalities in Eq. Q and Eq. 
By considering the function g{t) = vJ F"{w + tv)u, we have g'{t) = F"'{w + tv)[u, u, v], 
which is such that \g'{t)\ ^ \\v\\2Rg{t), leading to 5(0)e~ll^il2'R* ^ ^(i) ^ £,(0)611^112-^**, and 
thus to Eq. ([6]) for t = 1 (when considered for all u G W). 

In order to prove Eq. (O, we consider h{t) = {F'{w + tv) — F'{w) — F"{w)vt). We 
have/i(0) = 0, /i'(0) = Oand/i"(t) = F"'{w+tv)[v,v,z\ ^ fi||v||2e*-^ll^ll2 [zTi?"(y;)^]i/2[^T^//(^)^]i/2 

using Eq. ^ and Eq. ([TtI) . Thus, by integrating between and t, 

h'{t) ^ [z^F"('«j)z]l/2[^;T^//(y^)^]l/2(gt/?|K,|b _ 

which implies /i(l) ^ [z^F"(u;)z]i/2[^;T^//(y^)^]i/2 j^i^giij||«||2 _ i)^^^^ which in turn leads 
to Eq. ©. 

Using similar techniques, i.e., by considering the function t \-^= z'^[F"{w + tv) — 
F"{w)]u, we can prove that for all z, u,v,w G W, we have: 

z'^[F"{w + v) - F"{w)]u ^ ^ [v'^F"{w)v]^/^[z'^F"iw)z]^/^\\u\\2. (22) 

A.3 Proof of Proposition |2] 

Since we have assumed that X{w) > 0, then by Eq. the Hessian of F is everywhere 
invertible, and hence the function F is strictly convex. Therefore, if the minimum is attained, 
it is unique. 

Let v G be such that F"{w)v = 1. Without loss of generality, we may assume 
that F'{w)^ V is negative. This implies that for all t ^ 0, F{w + tv) ^ F{w). Moreover, 

let us denote k = —v^ F' {w)R\\v\\2, which is nonnegative and such that k ^ ^^^^(^^1/2'^ ^ 
^ 1 /2. From Eq. ©, for all t ^ 0, we have: 



F(w + tv) ^ Fiw) + v^F'(w)t + „„ (e--^ll"ll2* + Myht - 1) 



12 

1 r ~R\\v\\2t 



V 



> + ^2|I..II2 + (1-^)^11^112*-! 



Moreover, a short calculation shows that for all k G (0, 1]: 

g-2K(i-K)-i ^ (-^ _ ^)2k(1 - -1^0. (23) 

This implies that for to = 2(i?||t;||2)"^K(l - + t^v) ^ F('u;). Since to ^ 

j^\v'^F'{w)\ ^ 2v{F,w) (l - 'j^^y ^ ^i^{F,w), weha.veF{w + tv) ^ F{w) for 
t = Aiy{F,w). 
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Since this is true for all v such that F"{w)v = 1, this shows that the value of the 
function F on the entire ellipsoid (since F"{w) is positive definite) F"{w)v = 16z^(F, w)"^ 
is greater or equal to the value at w; thus, by convexity, there must be a minimizer w* — which 
is unique because of Eq. Q — of F such that 



{w - w*) ' F"{w){w - w*) ^ lQv{F,wY, 

leading to Eq. 

In order to prove Eq. (O, we will simply apply Eq. ^ &tw + v, which requires to upper- 
bound v{F, w + v). If we denote hy v = — F" {w)~^ F' (w) the Newton step, we have: 

\\F"iwr^/^F'{w + v)h 
= \\F"{w)-^/^[F\w + v)- F'{w) - F"{w)v]\\^ 

F"{wy^''^[F"{w + tv) - F"{w)]vdt 
F"{wr^'^[F"{w + tv) - F"{w)]F"{w)-^I^F"{wf'^v 



dt 



dt. 



F"{w)~^/^F"{w + tv)F"{w)-^/^ - I F"{wf'\ 

Moreover, we have from Eq. 

(e-t«lkll2 _ i)j ^ F"{w)-^''^F"{w + tv)F"{w)-^''^ -I 4 (e*^"""^ - 1)/. 

Thus, 

\\F"{w)-^^^F'{w + v)\\2 ^ /^max{e*^ll^ll^ - 1, 1 - e-*^ll^ll^}||F"(w;)^/2^||2dt 

Jo 

= i^{F,w) / (e*^ll^ll2-l)dt = z.(F,u;)^ ^V^^^- 

Jo RmU 

Therefore, using Eq. ^ again, we obtain: 

u{F,w + v) = \\F"{w + v)-^/'^F'{w + v)\\2 ^ i/(F,«;)e-^M2/2 
We have R\\v\\2 ^ RX~^/'^iy{F,w) < 1/2, and thus, we have 



R\\v\ 



leading to: 



oR\\v\\2 _ l-R\\y\\^ 

w + v) ^ 



^ R\\v\\2 Ri^{F,w)X{w) 



-1/2 



R 



v{F,wf. 



(24) 
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Moreover, we have: 

Riy{F,w + v) Re^M^/^ R e^M^ -1- R\\v\\2 

A(^ + .)V2 ^ ^h^^(^'- + ^)^aRV2-(^'-)^ : 

which leads to Eq. dSjl. Moreover, it shows that we can apply Eq. ^atw + v and get: 

[{w* — w — v)~^ F"{w){w* — w — v)]^^"^ 

^ 4e^ll''ll2/2z,(F,u; + u) ^ 4i?||t;||2i/(F, -w), 
which leads to the desired result, i.e., Eq. 

B Proof of Theorem U 



Following 112611271 . we denote by w\ the unique global minimizer of the expected regularized 
risk J\{w) = Jo{w) + We simply apply Eq. ^ from Proposition |2] to Jx and wx, 

to obtain, if the Newton decrement (see Section |2] for its definition) u{Jx,wx)^ is less than 
\/AR^, that Wx and its population counterpart wx are close, i.e.: 



{wx-wxY J'i{wx){wx-wx) ^ l&v{Jx,wxY. 

We can then apply the upper Taylor expansion in Eq. ^ from Proposition [T] to Jx and wx, to 
obtain, with v = wx — wx (which is such that -^111^112 ^ A ^'^^^^f^^^ ^ 2): 



Uwx) - JxM ^ '-^^If^ie-M 



V- R\\v\\2-1) ^20u{Jx,wxy. 

Therefore, for any wq € K^, since wx is the minimizer of Jx{w) = Jo{w) + ^||w|||: 



Mwx) ^ Jo{wo) + ^\\wof2 + 20i^{Jx,wxf. (25) 

We can now apply the concentration inequaUty from Proposition |4] in Appendix iGl i.e., 

Eq. (ill), with u = log(8/5). We use A = i9i?2^i£sMl. in order to actually have 

v{Jx,wx) ^ A^/2/2i? (so that we can apply our self-concordant analysis), it is sufficient 
that: 

AlR^u/\n ^ \/m^, 63(^/^)3/2^2/^ ^ A/16i^^ S{u/nfR^/X ^ A/16i?^ 
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leading to the constraints u ^ n/125. We then get with probability at least 1 — 5 = 1 — 8e " 
(for n ^ n/125): 

jl-^^j, 1,2 ^ ^ , (10 + 100i?^|ko||i)Vlog(8A) 



For u > n/12b, the bound in Eq. ([T2l ) is always satisfied. Indeed, this implies with our choice 
of A that A ^ R^. Moreover, since HfWAlll bounded from above by log(2)A~^ ^ R~^, 

Jg{w\) ^ Jg{wq) + ^11% - wqW'f ^ Joi'wo) + 1 + i?^||w;o|li, 
which is smaller than the right hand-side of Eq. ([T2l) . 



C Proof of Theorem |2] 

We denote by Jq the second-order Taylor expansion of Jq around wq, equal to Jq{w) = 
Jo{wo) + ^{w — wo)~^Q{w — Wo), with Q = Jq{wo), and Jq the expansion of Jq around 
wq, equal to Jq{w) — q^w. We denote by the one-step Newton iterate from wq for the 
function Jq, defined as the global minimizer of Jq and equal to w^ = wq + {Q + \I)~^{q — 
Xwq). 

What the following proposition shows is that we can replace Jq by Jq for obtaining the 
estimator and that we can replace Jo by Jq for measuring its performance, i.e., we may do 
as if we had a weighted least-squares cost, as long as the Newton decrement is small enough: 



Proposition 3 (Quadratic approximation of risks) Assume v{J\,wq) = {q—Xwo){Q- 

X 



A/)~^(g - Xwo) ^ We have: 



\Jo[wx) - Joiwx)] ^ ^ WQ ' [wx - wo)\\2 + ^^i'{Jx,wo) ■ (26) 

Proof We show that (1) w^ is close to wx using Proposition |2] on the behavior of Newton's 
method, (2) that wf^ is close to wq by using its closed form w^ = wo + {Q+XI)^^{q — Xwq), 
and (3) that Jq and Jq are close using Proposition [T] on upper and lower Taylor expansions. 
We first apply Eq. ^ from Proposition |2] to get 

{wx - w^)^J'^{wo){wx - w^) ^ ^u{Jx,wo)'. (27) 
This implies that wx and w^ are close, i.e.. 



\wx-w^f ^ X-\wx-w^)^JUwo)iwx-w^) 

16R'^ ,^ 4 , . ,2 1 

s$ -^^I'yJx^WQ) ^-u{Jx,wq) ^ 
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Thus, using the closed form expression for = + {Q + XI) ^{q — Xwq), we obtain 

ll-f^A - •"'oil ^ \\WX-Wx\\ + \\W0-Wx\\ 

^ ^ '^{Jx,wo) u{Jx,wo) ^ 3u{Jx,wo) ^ _3_ 
AV2 + AV2 ^ Ai/2 ^ 2R' 

We can now apply Eq. ^ from Proposition |2] to get for all v such that -R||w||2 ^ 3/2, 

IMwo + v)- 4 (wo + v)\^ {v'^Qv)R\\v\\2/i. (28) 

Thus, using Eq. ( |28l ) for v = wx — wq and v = — wq : 

\Mwx)-4{w^)\ 
^ \Mwx) - 4{wx)\ + \4{w^) - 4{wx)\, 

^ j\\wx- woh \\Q'^\wx -wo)\\l + l\ "^'^'^"^^ " "^^^"^ - IIQ'/'« - ^o)||? 

^ ''^ltr^ llQ'^^(^A-^0)||i + ^ IIQ^/^(^A-^0)||i-||Q^/^«--0)||i 



3i?i/(JA,tt;o) 



||Qi/2« - u;o)i + (i + l) ||Q'/'(^A - wo)\\l - \\Q'/\w^ - wo)\\ 



4A1/2 

3i?KiA,U;o)||^V2(^^^_^„)||2 



4A1/2 

+ ^IIQ^/^(^A - Olli + ^||g^/^(.iA - Ol|2||Q^/^« - «'0)||2. 

FromEq. we have \\Q^^'^{wx - w^)\\l ^ ^^i^{Jx,wo)'^. We thus obtain, using 
that||QV2(^7V_^^)||2 ^^(j^^^^). 

|Jo(^a) - Jo^«)l^(^ + ^V32) ^^^^^||Q^/^« - ..0)112 + ^KiA,-o)^ 
which leads to the desired result. ■ 



We can now go on with the proof of Theorem |2l From Eq. (l26l ) in Proposition [3] above, 

we have, if i^{Jx, wq)'^ ^ A/4i?^, 

Mwx) = 4{w^) + B 

= Joiwo) + \iQ- >^wofQ{Q + >^irHq - >^wo) + B 

= Uwo) + ^ + ^i + B + C, 
In 2 

withC = Xw1{Q + XI)-'^Qq + ]^iT{Q + XI)-^Q{qq'^ -^Q^ , 

1^1 HQ ^ (W'A -^^0)||2 + ^^l^(A,W0) • 
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We can now bound each term separately and check that we indeed have v{J\,wq)'^ ^ \/AR^ 
(which allows to apply Proposition |2l). First, from Eq. (fT3] ). we can derive 

, , ^2 , , , di kA1/2 ^2 n 1/2 . /«A1/2 di.l/2 

&2 H ^ oi H ^ -^-[b2 H ) ^ —5-\0i H ) 1 

n n K n ' K n 

which implies the following identities: 

do kP'X 
h2 + — i^hi + — (29) 

We have moreover: 

iy{Jx,WQf = {q - Xwo)^ {Q + XI)'\q - Xwo) 

^ 6l + — + tr (Q + A/)-l (qq'^ - -] + 2XwJ {Q + XI)-^q. 
n \ n J 

We can now apply concentration inequalities from Appendix |Gl together with the following 
applications of Bernstein's inequahty. Indeed, we have Xw^ {Q + XI)~^Qq = Y^^=i with 

\Zi\ ^ -\v4{Q + XI)~'^Qxi\ 
n 

A / X , „ , 9 „ \ V2 / , „ , , „ X 1/2 ^1/2 



Moreover, EZ^ ^ + XI)~'^Q'^{Q + Xiy^wo ^ ^62- We can now apply Bernstein 

inequality [2] to get with probability at least 1 — 2e~" (and using Eq. (|29l)): 



, T/^ , 9^ /262^i 1i,l/2^, 1/9 ^boU UK 

xwjiQ + A/ r'Q'? ^ + — 6r^A-i/2 ^ + 

V n on V ?^ on 

Similarly, with probability at least 1 — 2e~", we have: 



^ T ^ 1 / 269n UK 

We thus get, through the union bound, with probability at least 1 — 20e~" : 

^ + — + n3/2Ai/2 +^1^ 



77, on / 



dl 647X^/2^^ , ^2^1/2 , , ^ i?2 9^2 

n 



ai D^'U ' /, «2\l/2 U, 

"1^/^ n n A 77,^ 



5377^/2^^3/2 



n 



3/2 



Ak2 ^ 
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together with C ^ E. We now take u = (7162 + d2)v'^ and assume t; ^ 1/4, k ^ 1/16, and 

3^ 



v^{nb2 + (^2)^^^ ^ 12, so that, we have 



E ^ 64.(62 + % +.^(52 + % (18 + ^) + ^.^(62 + ^)^ 

n 

^ (62 + ^) ^64^; + (18 +^y + ^^'^ + 53Kt;3(n62 + ^2)^2^ , 

^ (62 + — ) f 64u + 18.2 + -u^ + 9k^v^ + 53Kv'\nb2 + ^2)^^^ ) , 
^ n \ 6 / 

^ (b2 + —)( 68.5v + + 9k/16 x 16 x 16 + 53k x 

^ n ^ \ 6 X 16 

^ (62 + — )(69u + 10k) ^20(62 + — ). 



12^ 
64/' 



This imphes that u{Jx,wo)'^ ^ ^ i^' ^^^^ '^^'^ ^PPly Proposition |2] Thus, by 

denoting 62 = 62 + > ei = bi + ^, and a = 69v + 10k ^ 20, we get a global upper bound: 

5 + |CK 620 + ^^(ei + esa)^ + (ei + e2a)(l + a)^/^. 

With ei + 620 ^ e2/^(KAV2/ii)(l + a), we get 

B + \C\ ^ 620 + 40^262(1 + 0)^ + 15^62(1 + 0)^/^ 

^ 62a + 62k(40 X 21 X 21/16 + 15(21)2/2) ^ ^^(egu + 2560k), 

which leads to the desired result, i.e., Eq. ([T4l) . 

D Proof of Theorem |3] 

We follow the same proof technique than for Theorem |2] in Appendix |C] We have: 

Jo{w\) = Jo{wx) + q^iwx - Wo) + q^wo 

= Mwx) + q'^iwx - w^^) + q^{w^ - w^) - q'^ J^w^ J'x{w^ ) + q'^wo, 

where w^'^ is the two-step Newton iterate from wq- We have, from Eq. (l24l ). iy{Jx, w^) ^ 
A172' 



^^rZ^(<^A) ^0)^. which then implies (with Eq. Q): 



(^A-*D^(Q + A/)(^A-u^f^) ^ ^( ^KA,«;o)^ 1 ^ 



IQR^ ( 2R ^ 5l2R^i^{Jx,woY 



A3 
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which in turn implies 

Moreover, we have from the closed-form expression of : 

\q^{w^ - «^o) - ^1 ^ I tr(Q + \iy\qq^ - Q/n)\ + Xw], {Q + Xiy^q. (31) 
Finally, we have, using Eq. © from Proposition [T] 



Ai/2 



< 2[,^Q(Q + A/)-g]^/^||QV2^|b^^^^(^ (32) 



where A = — wq. 

What also needs to be shown is that | tr Qx{Qx + A/)~^ — tr(5(Q + '^^)~^| is small 
enough; by noting that Q = Jq{wo), Q\ = Jq{wq + v), and v = w\ — WQ,\Me have, using 
Eq. (|22l) from Appendix IA.2I 

\iTQx{Qx + \iy^ -iTQ{Q + Xir^\ 

= \\iT[{Qx + \ir\Q-Qx){Q + \ir^]\ 
p 

^ \Y,\5j{Qx + \ir\Q - Qx){Q + \IrH^\ 

i=l 

V 

^ Ai? + A/)-i<5i||2||(QA + \IY^ khWO'l'^vh 

i=l 

P 

^ X-^/^R\\Q^/\\\2^SjQiQ + XI)-^5i = X-^/'^R\\Q^/\\\2di. (33) 



i=l 



All the terms in Eqs. (13013 1I32I33I ) that need to be added to obtain the required upperbound 
are essentially the same than the ones proof of Theorem |2] in Appendix O (with smaller 
constants). Thus the rest of the proof follows. 

E Proof of Theorem H 

We follow the same proof technique than for the Lasso |[T5l [T2l [141 . i.e., we consider w 
the minimizer of Jo{w) + Xs^w subject to wk'^ = (which is unique because Qkk is 
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invertible), and (1) show that wk has the coiTect (non zero) signs and (2) that it is actually 
the unrestricted minimum of Jq{w) + A||i(;||i over W, i.e., using optimality conditions for 
nonsmooth convex optimization problems P8l . that || [^o(^)]^'' lloo ^ ^- ^^^^ ^^^^ 
shown by replacing w by the proper one-step Newton iterate from wq. 

Correct signs on K. We directly use Proposition |2] with the function wk ^ Jo{wk, 0) + 
Xs~^wk — where {wk,0) denotes the p-dimensional vector obtained by completing by 
zeros — to obtain from Eq. ([71): 

{■WK - {wo)k)^ Qkk{wk - {wo)k) ^ 16(gi^ - XskY Q'^j^{qK - \sk) = 16i/^, 

as soon as z/^ = {qx — Xsk)~^ Q^^xigK — As^) ^ and thus as soon as QkQ^kIk ^ 
and X^sJ^Q'^^j^SK ^ g^- We thus have: 

\\w - WqWoo ^ \\WK - {wq)k\\2 ^ P^^^'^\\Q^k]<{'^J< ~ {'^o)k)\\2 ^ ^P^^^'^V- 



We therefore get the correct signs for the covariates indexed by K, as soon as ||^Z; — li^oHoo ^ 
mirijgi^ l(^o)iP = P^, as soon as 

max \qKQK^K(lK: sI^Q^kk^k^ ^ ™™ {^^^' 8^} ' 
Note that sJ^Q^^sk ^ \K\p^'^, thus it is implied by the following constraint: 

^^^W^'"^"^'^'^"'^' ^^^^ 

Gradient condition on K'^. We denote by the one-step Newton iterate from wq for the 
minimization of Jo{w) + Xs^w restricted to -Wi^c = 0, equal to = {wo)k + Qk\'{qk — 
Xsk)- From Eq. we get: 

(wk - wf^y Qkk{wk - w^) ^ —^[{qk - ^skY Q~^j^{qK - Xsk)]^ = ^^"^ ^ ■ 
We thus have 

, A D,,2 

p^l'^ p 

\\w-Wq\\2 ^ \\W -W^\\2 + \\WQ-W^\\2i^^yp~^^'^ 

Note that up to here, all bounds R may be replaced by the maximal ^2-norm of all data points, 
reduced to variables in K. 



\W-W-\\2 ^ ^ __ ^ 1/^^ 
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In order to check the gradient condition, we compute the gradient of Jq along the direc- 
tions in K'^, to obtain for all z ^ W, using Eq. ^ and with any v such that -R||t'||2 ^ 3/2 



(zTQz)i/2 ^ ^ ^ ' R\\v\ 



2 



where Tq{w) = Jljiwo) + jQ{wo){w — wq) is the derivative of the Taylor expansion of Jq 
around wq. This implies, since diag((5) ^1/4, the following £oo-bound on the difference Jq 
and its Taylor expansion: 

\\[Jl){wo + v) - f;^iwo + v)]k4oo < {v'^Qvy/^R\\v\\2. 

We now have. 



K'^ Woo 



+ \\Tq{w )/<c - Tq{w)kA\oo + \\Tq{w)k'^ - Jo('w)i^=||oo, 
^ \\[J'o{wq) + Q{W^ - Wo)\kA\oo 

+ \\[Q[w - w^'y^KAU + R\\w - wMQ^'^[w - wo)h, 

^ \\ — QK'^ + QK''KQ~KKi(lK — ^Sk)\\oo 

^\Qk^kQkkQ'kk^^k - «)x)lloo + 3zvi?p-i/2(4i?z.V"'/' + y). 

^ \qK- - Qk'^kQk^k^QK - ><Sk)\\oo + ^WQ^clci'^K - Wk)\\2 + 
II — 1 I \ \ii ^ 16i? 2 9i? 2 



16i? 2 



Thus, in order to get || jo(^)A'<=||oo ^ \ we need 

Ikx'^ - Qk''kQ~kk^k\\oo ^ ??A/4, (36) 

and 

In terms of upper bound on A we then get: 



A ^ mm < — — rrp^fJ', —. — nTo-"- ' 



4|i^|i/2^' 41^11/2^" '64i?|i^|}' 

which can be reduced A ^ min |^-^t72/^) g/rik] }• terms of upper bound on qj^Q^j^qK 
we get: 

qIQ-kUk ^ min Ai?-2, ^!?^\ 

^K-^KK'iJ^ 1 le'^ 16 64i? J 
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which can be reduced to qJcQj/j^qx ^ min | ^A*^, '^64/? }' using the constraint on A. 

We now derive and use concentration inequalities. We first use Bernstein's inequality 
(using for all k and i, \{xi)k - 1^*1 =^ R/p^^^ and Qkk ^ 1/4), and the 

union bound to get 

/ riA^r?^ /32 \ 

n\\qK^ - Qk^kQkkIkIU ^ A77/4) ^ 2pexp —i^^-^^^ J 

^ 2pexp(^ — 

as soon as i.e., as soon as, A ^ Sp^^'^R ^, which is indeed satisfied because 

of our assumption on A. We also use Bernstein's inequality to get 



nqlQAlK >t)^ fI^WkWoo > y/^) ^ 2\K\ exp ( " J||) • 
The union bound then leads to the desired result. 

F Proof of Theorem |5] 

We follow the proof technique of |[T6l . We have Jo{wx) = Joiwx) — q~^wx- Thus, because 
wx is aminimizer of Jo{w) + A||u;||i, 

Jo{wx) - q^wx + AII^aIIi ^ Jo{wo) - q~^wo + A||t(;o||i, (38) 
which implies, since Jo{wx) ^ -'o(^^o)' 

A||^^:'A||l ^ A||u;o||i + ||g||oo||^i'A - 1(^0111, 
m^ox)K\\i + A||(%)xHli ^ A||(t(;o)i^||i + |k||oo(||(%)/< - (^^o)a'||i + ll(^^A)i^Hli)- 

If we denote hy A = wx — wq the estimation eiTor, we deduce: 

(A - ||g||oo)||Axc||i 5^ (A + ||g||oo)||AK||i. 

If we assume ||g||oo ^ A/2, then, we have ||Axc||i ^ 3||Ai^||i, and thus using (AS), we get 
A^QA ^ p'^WAkWI- From Eq. we thus get: 

Jo{wx) - Jo{wo) ^ q^{wx- Wo) - X\\wx\\i + X\\wo\\i, 

3A 

Jo(ii;o + A)- Jo(^i;o) ^ (|k||oo + A)||A||i — 1| A||i. (39) 
Using Eq. (|3]) in Proposition [U with Jq, we obtain: 

Mwo + A) - Uwo) ^ ^^p^l^"""'^"' + ^ll^lb - 1)> 
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which impUes, using A^QA. ^ p^HAj^Hl and Eq. 09l ): 

^«(e-«INI"+H||A||.-l)<|||A||,. 
We can now use, with s = \K\, ||A||2 ^ ^ 4||Aii'||i ^ 4^/s||Ax||2 to get: 

\Ak\\2 



(40) 



,^(e-^ll^ll^ + - 1) ^ ^ (V^^^M;^ ^ 24A.i?^||A|b. 



This implies using Eq. (ES, that i?||A||2 ^ itf^xlR^p^ ^ ^ a soon as RXsp'"^ ^ 1/48, 
which itself implies that ^^n^n^ (e'-^H'^H^ _^ i?||A||2 - l) ^ 1/2, and thus, fromEq. (l40l ). 



3A 

|Ak||2 ^ y X 4Vs||Ai^||2. 



The second result then follows from Eq. ( |39b (using Bernstein inequality for an upper bound 

onP(||(?||oo > A/2)). 

G Concentration inequalities 

In this section, we derive concentration inequalities for quadratic forms of bounded random 
variables that extend the ones already known for Gaussian random variables |[28l . The fol- 
lowing proposition is a simple corollary of a general concentration result on U-statistics ifTTl . 

Proposition 4 Let yi, . . . ,yn be n vectors in W such that \\yi\\2 ^ hfor all i = 1, . . . , n 

and Y = [yj , . . . ^y^]'^ G M"^?'. Let e G be a vector of zero-mean independent 
random variables almost surely bounded by 1 and with variances af, i = 1, . . . ,n. Let 
S = Diag(o-i)'^yy^ Diag(cJi). Then, for all u ^ 0.- 

P[|e^yy^e-tr5| ^ 32tr(52)i/V/2 + l8Amax(S)^/ 

+ 1266(tr 5)^/2u3/2 + SD&^u^] ^ 8e^". (41) 

Proof We apply Theorem 3.4 from ifTTTl . with Ti = £«, gij{ti, tj) = yJyjUtj if \ti\, \tj\ ^ 1 
and zero otherwise. We then have (following notations from IfTTl ): 

A = maxlyj yj\ ^ b^, 



= max ^ y](2/72/i)^o-? ^ max ^yjyib'^a'^, ^ h^ii{S), 

jG{l,...,r!,| jG{l,...,n| 



3<i 
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Thus (using e = 4 in Ull): 

Moreover, we have from Bernstein's inequality 121: 

leading to the desired result, noting that for u ^ log(8), the bound is trivial. ■ 

We can apply to our setting to get, withyj = ^{P+\I)^'^/'^Xi (with ||xi||2 ^ i?), leading 
to 6 = \Rn~^\~^/^ and 5 = ^ T)mg{a)X{P + \I)-^X^ Diag(cj). 

Misspecifled models. If no assumptions are made, we simply have: \raa.x{S) ^ (tr5^)^/^ ^ 

tr(5) ^ B? /\n and we get after bringing terms together: 



4lR^u ( u 



Xn 



,3/2 



+ ^ 8^ + 63- 



A \ n 



n 



3/2 



(42) 



Well-specified models In this case, P = Q and Aniax('S') ^ 1/n, tr5 = di/n, tr5^ 



q^{P + Xiy\-^ 



n 



324/^1/2 53i?(iy\3/2 ^2^2 

n n n^/2A^/2 Xn"^ 



^ 8e-". (43) 
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